In [1]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
pd.options.display.max_rows = 8
Data is often stored in CSV files or databases in so-called “stacked” or “record” format:
In [2]:
df = pd.DataFrame({'subject':['A', 'A', 'B', 'B'],
'treatment':['CH', 'DT', 'CH', 'DT'],
'concentration':range(4)},
columns=['subject', 'treatment', 'concentration'])
df
Out[2]:
A better representation might be one where the different subjects are in rows, the applied treatments are in columns and outcomes are in the data frame values.
You can achieve this by pivot
function:
In [3]:
pivoted = df.pivot(index='subject', columns='treatment', values='concentration')
pivoted
Out[3]:
If there is more that one record for each pair of "subject" and "treatment" (for example, the subject was tested twice with the same treatment at different times) you can use pivot_table
. It works just like pivot
but it allows to specify additionally an aggregation function ('mean'
by default).
To take another example, we will use some data from expeditions to the Pole of Inaccessibility. We will read the data from SQL database.
In [4]:
from sqlalchemy import create_engine
engine = create_engine('sqlite:///data/survey.db')
In [5]:
visited = pd.read_sql('Visited', engine, index_col='ident', parse_dates=['dated'])
visited
Out[5]:
In [6]:
readings = pd.read_sql('Survey', engine).dropna()
readings = readings.drop_duplicates()
readings
Out[6]:
In [ ]:
In [ ]:
Hierarchical index of pandas is a way of introducing another dimension to a (two-dimensional) data frame. This is implemented by having multiple levels of the index. Let's look at an example.
In [9]:
multi = df.set_index(['subject', 'treatment'])
multi
Out[9]:
Note how the two indexes are nested: 2nd level index ('treatment') is grouped under the first level index ('subject'). To access the two levels you can use labels from the first level or both levels using a tuple.
In [10]:
multi.loc['A'] # first level only
Out[10]:
Note that it creates a standard data frame with "flat" index.
In [11]:
multi.loc[('A', 'CH')] # two level
Out[11]:
Indexing on the second index only may be slightly involved:
In [12]:
multi.loc[(slice(None), 'CH'), :]
Out[12]:
Consult the documentation for other methods.
To return to orginal format with columns insted of indexes use reset_index
:
In [13]:
multi.reset_index()
Out[13]:
In [ ]:
stack
— shifts last level of hierarchical rows to columns
unstack
— does the opposite, i.e. shifts last level of hierarchical columns to rows
In [15]:
result = multi['concentration'].unstack()
result
Out[15]:
unstack
reverses the operation:
In [16]:
result.stack()
Out[16]:
We can "stack" it even further:
In [17]:
df = multi.stack()
df
Out[17]:
In [ ]:
Going further with the time series case study test on the AirBase (The European Air quality dataBase) data.
One of the actual downloaded raw data files of AirBase is included in the repo:
In [19]:
!head -1 ./data/BETR8010000800100hour.1-1-1990.31-12-2012
Just reading the tab-delimited data:
In [20]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t')#, header=None)
data.head()
Out[20]:
The above data is clearly not ready to be used! Each row contains the 24 measurements for each hour of the day, and also contains a flag (0/1) indicating the quality of the data.
Lets replace the negative numbers by missing values and give columns proper names.
In [21]:
hours = map(str, range(24))
flags = ['flag'] * 24
col_names = ['date'] + list(sum(zip(hours, flags), ()))
col_names[:5]
Out[21]:
In [22]:
data = pd.read_csv("data/BETR8010000800100hour.1-1-1990.31-12-2012", sep='\t',
na_values=['-999', '-9999'],
names=col_names,
index_col='date')#, header=None)
For now, we disregard the 'flag' columns
In [23]:
data = data.drop('flag', axis=1)
data.head()
Out[23]:
Now, we want to reshape it: our goal is to have the different hours as row indices, merged with the date into a datetime-index.
We can now use stack
and some other functions to create a timeseries from the original dataframe:
The end result should look like:
BETR801 | |
---|---|
1990-01-02 09:00:00 | 48.0 |
1990-01-02 12:00:00 | 48.0 |
1990-01-02 13:00:00 | 50.0 |
1990-01-02 14:00:00 | 55.0 |
... | ... |
2012-12-31 20:00:00 | 16.5 |
2012-12-31 21:00:00 | 14.5 |
2012-12-31 22:00:00 | 16.5 |
2012-12-31 23:00:00 | 15.0 |
170794 rows × 1 columns
First, reshape the dataframe so that each row consists of one observation for one date + hour combination:
In [ ]:
Now, combine the date and hour colums into a datetime (tip: string columns can be summed to concatenate the strings):
In [ ]:
In [ ]:
© 2015, Stijn Van Hoey and Joris Van den Bossche (mailto:stijnvanhoey@gmail.com, mailto:jorisvandenbossche@gmail.com).
© 2015, modified by Bartosz Teleńczuk (original sources available from https://github.com/jorisvandenbossche/2015-EuroScipy-pandas-tutorial)
Licensed under CC BY 4.0 Creative Commons